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ABSTRACT 

Nearest  neighbor  searching  (NNS)  is  a  common  classification 
method,  but  its  brute-force  (BF)  implementation  is  inefficient  for 
dimensions  greater  than  10.  We  present  Cellular  Class  Encoding 
(CCE),  shown  to  be  1.1 -1.7  times  faster  than  BE  on  real-world,  14- 
dimensional  data  sets.  Moreover,  if  applied  to  bounded  sets,  CCE 
is  a  full- search  equivalent  to  BE. 

Given  a  query  in  an  indexed  cell  of  a  partitioned  bounded 
space,  the  CCE’s  efficiency  is  achieved  by  only  performing  NNS 
on  those  database  elements  which  could  not  be  eliminated  a  priori 
as  impossible  nearest  neighbors  of  that  cell’s  resident  vectors.  To 
ensure  CCE  is  a  viable  alternative  in  real-world  applications,  we 
use  VQ  speaker  identification  as  a  testbed  application  and  present 
results. 

Index  Terms —  Nearest  neighbor  search,  vector  quantization, 
cellular  class  encoding,  vector  approximation-file 

1.  INTRODUCTION 

Nearest  neighbor  searching  (NNS)  is  a  common  classification 
method  in  pattern  recognition  systems  such  as  Vector  Quantization 
(VQ).  The  main  drawback  with  this  approach  is  that  the  full- 
search  (a.k.a.  brute-force)  technique  is  very  inefficient  for 
dimensions  greater  than  10  which  is  only  compounded  when  large 
database  sizes  are  involved.  Alternative  full-search  equivalents 
(ESE)  to  BE  do  exist — one  of  the  more  successful  of  these  reported 
is  the  Approximating  and  Eliminating  Search  Algorithm  (AESA) 
[1]  and  its  derivatives  such  as  Linear  AESA  [2]  and  Reduced 
Overhead  AESA  [3].  Tests  of  these  three  alternatives  failed  to  be 
any  faster  than  BE  for  our  testbed  application  of  VQ  speaker  ID. 

The  Vector  Approximation-Eile  (VA-E)  method  presented  in 
[4]  is  a  viable  ESE  alternative  in  real-world  applications.  Our 
implementation  of  this  also  failed  to  be  any  faster  than  BE; 
however,  there  were  some  useful  ideas  in  VA-E  that  we  felt  could 
be  modified  for  our  purposes — thus  inspiring  the  Cellular  Class 
Encoding  (CCE)  method. 

Given  any  test  vector  in  an  indexed  cell  of  a  partitioned  space, 
the  CCE’s  efficiency  is  achieved  by  only  performing  NNS  on  those 
database  elements  which  could  not  be  eliminated  a  priori  as 
impossible  nearest  neighbors  of  any  vector  residing  in  that  cell.  It 
is  the  indices  of  the  non-eliminated  database  elements  which  form 
the  class  for  that  cell.  To  ensure  CCE  was  a  viable  alternative  in 
real-world  applications,  we  used  VQ  speaker  identification  (SID) 
as  a  testbed  application — results  are  reported  on  population 
ranging  from  10  to  1500  enrolled  speakers. 

In  addition  to  its  use  in  SID,  VQ  is  an  essential  component  of 
many  audio  compression  techniques.  Eor  example,  CELP  and 
systems  that  use  CELP  such  as  SPEEX,  G.729,  TwinVQ,  Vorbis, 
AMR-WB-I-,  and  DTS  all  rely  on  VQ  codebook  searches  to  enable 


their  lossy  compression  techniques.  Hence,  a  full- search 
equivalent  that  is  more  efficient  than  BP  for  dimensions  greater 
than  10  will  find  broader  applicability  than  just  SID. 

2.  VECTOR  APPROXIMATION-FILE  DESCRIPTION 

Henceforth,  we  shall  use  the  following  notation:  to  denote  a 

query  vector,  v.  for  the  i-th  vector  of  the  database,  v.  j  for  the  y-th 
component  of  v. ,  b.  for  the  number  of  bits  for  approximating 
values  in  the  y-th  dimension,  p.  [k]  for  the  k-ih  partition  point  in 
the  y-th  dimension,  and  r  .  for  the  region  in  which  v.  resides  in  the 
y-th  dimension  or  alternatively  r^  .  \n\  shall  mean  r  .  =  n. 

In  the  VA-E  method,  b  bits  are  allotted  to  partition  a  bounded 
region  of  a  J-dimensional  vector  space  into  2^  uniform  cells.  Each 
cell  is  uniquely  bit-encoded  which  serves  to  quantize  its  resident 
vectors.  To  aid  our  description,  we  summarize  the  example  given 
in  [4].  In  this  example,  3  bits  encode  the  partitioning  of  a  bounded 
plane  region  such  that  2  bits  encode  the  v-dimension’s  partitioning 
into  4  regions,  1  bit  encodes  the  partitioning  of  the  y-dimension 
into  2  regions  (see  Pigure  1).  Eor  a  generic  query  vector 
,  the  partition  points  Pq[2\  and  /?o[3]  which  bound 

region  PoP]  <  v^o  <  Po[3].  Thus,  we  say 

component  resides  in  region  r^o[2]  encode  component 
v^q’s  approximation  as  the  binary  equivalent  of  this  region’s 
decimal  index  2.  That  is,  we  encode  v^^’s  approximation  as  the 
bit- string  ‘10’.  Likewise,  the  partition  points and  p\[l\  which 
bound  r^i[l]  are  such  that  ^  '^,.1  <Pi[l\.  Hence,  component 
j  resides  in  region  j  [l]  and  we  encode  its  approximation  as  the 
bit-string  ‘1’.  Thus,  the  bit-encoded  vector  approximation  for  is 

just  the  concatenation  of  its  resident  regions’  bit- encodings, 
yielding  the  bit-string  ‘101’.  Pinally,  the  VA-Pile  for  a  database  of 
N  vectors  is  one  long  array  of  all  N  vectors’  bit- encoded 
approximations,  plus  supporting  information  such  as  the  number  of 
bits,  bj ,  per  approximation  in  the  y-th  dimension  and  the  partition 

points  Pj  [o],-  -.pj  [2^^  ]  that  define  the  y-th  dimension’ s  partition. 

In  subsequent  database  searches  for  the  k  nearest-neighbors  of 
a  query  vector  v^,  we  sequentially  scan  each  database  element  v7s 

approximation  in  the  VA-file,  while  simultaneously  computing  a 
lower  bound  b  and  upper  bound  Uf  for  the  L^-distance  . 

These  bounds  are  computed  as 
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Hence,  a  database  element  v.  is  a  nearest  neighbor  (NN)  candidate 
if  less  than  k  vectors  have  been  scanned  thus  far  or  if  its  lower 


YX^ 

Fig.  1.  Example  of  using  a  3-bit  allocation  to  encode 
the  partitioning  of  the  xy-plane. 


bound  li  is  less  than  the  k-ih  element  in  the  current  candidate  set. 
Once  all  vector  approximations  in  the  database  have  been  scanned, 
the  actual  L^-distance  is  computed  only  on  those  elements  of  the 
final  candidate  set.  It  is  this  filtering  out  of  database  elements 
which  gives  the  boost  in  efficiency  in  this  NNS  approach. 


3.  CELLULAR  CLASS  ENCODING  DESCRIPTION 


As  a  candidate  alternative  to  the  brute-force  (BF)  method  of  NNS, 
we  were  hopeful  when  we  applied  the  VA-F  to  the  real-world  task 
of  VQ  SID.  We  were  surprised  and  disappointed  when  BF 
outperformed  the  VA-F  method.  It  is  unclear  why  this  is  so,  but 
we  surmise  that  there  are  some  less  understood  correlations  in  real- 
world  data  of  this  type  which  reduce  the  VA-F’s  filtering  capacity 
during  the  NNS.  These  disappointing  results  motivated  us  to 
develop  the  Cellular  Class  Encoding  (CCE)  method  of  NNS. 

3.1.  Overview 

The  CCE  method  is  an  extension  of  VA-E  in  that  we  borrow  from 
it  the  techniques  of  space  partitioning,  bit-encoded  approximations 
and  computation  of  the  lower/upper  distance  bounds.  What 
differentiates  CCE  from  VA-E  is  the  distance  bound  computations 
are  only  performed  offline  and  not  at  query  time.  Additionally,  in 
the  offline  stage  the  distance  bounds  are  used  to  map  a  cell  of  the 
partitioned  space  to  a  subset  of  database  elements  that  cannot  be 


eliminated  as  impossible  NNs  of  any  vector  residing  in  that  cell. 
Consequently,  during  future  NN  queries  we  avoid  the  lower/upper 
bound  computations  required  to  find  a  query  vector’s  NN 
candidates.  Instead,  we  only  need  to  determine  the  query  vector’s 
resident  cell,  from  which  we  directly  proceed  to  computing  the 
actual  Lp-distance  between  the  query  vector  and  that  cell’s  set  of 
NN  candidates. 


3.2.  Encoding  steps 


To  show  how  to  encode  a  cell’s  NN  candidate  class,  suppose  our 
database  is  composed  of  N,  J-dimensional  vectors  from  a  bounded 
feature  space  which  we  partition  into  2^  uniform  cells.  Hence,  b 
bits  can  be  used  to  encode  this  partitioning.  Note  that  each  cell  Q 
will  be  a  J-dimensional  hypercube;  let  us  use  to  denote  the 

region  in  the  y-th  dimension  intersected  by  Q.  Thus,  is 
bounded  by  partition  points  and  Hence,  for 


each  vector  v  ^  and  each  database  element  v.  we  compute  the 
lower  and  upper  bounds  and  W/  of  the  distance  as 
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Let  us  denote  the  least  upper  bound  of  the  database  elements  with 
respect  to  cell  Q  as 

lub^  =  min^Q^I  u.}  (9) 

Hence,  a  database  element  v.  can  be  eliminated  as  a  NN  of  all 
vectors  in  cell  Ck  if  its  lower  bound  //  is  greater  than  the  smallest 
upper  bound  lub^t-  Hence,  the  indices  of  non-eliminated  database 

elements  comprise  the  cellular  class,  Q  ,  of  cell  Q.  That  is. 


C,  ={i|/,<lubj  (10) 

Moreover,  if  for  example,  database  elements  8,12  and  19  make  up 
this  class,  we  compactly  encode  that  by  setting  8*,  12*  and  19* 
bits  of  an  unsigned,  V-bit  integer  to  ‘1’  and  all  others  set  to  ‘O’.  So 
in  addition  to  the  usual  information  saved  for  VA-E,  in  CCE 
method  we  also  save  the  V-bit  encodings  for  all  cellular  classes. 


3.2.  Decoding  steps 


To  find  the  closest  database  element  of  a  query  vector  v^,  the  first 
step  is  to  determine  the  cell,  C^,  of  the  previously  partitioned 
feature  space  that  resides — that  of  course  is  the  one  indexed  by 


2 


v^’s  bit-encoded  approximation  (refer  to  Section  2  on  how  this  is 
done).  Secondly,  having  Cq  allows  us  direct  access  to  its  class 
— its  candidate  NNs’  indices.  Lastly,  the  closest  database 
element  of  is  computed  as 

AW(vJ  =  argmin  {  |  (  £  C,  }  (1 1) 

4.  VQ  CLASSIFIER  FOR  SPEAKER  IDENTIFICATION 


size,  30  subsets  were  randomly  picked.  For  instance,  we  created 
30  different  subsets  of  10  speakers,  30  different  subsets  of  20 
speakers  and  so  on.  The  statistics  for  the  training  and  testing 
durations  are  summarized  in  Table  1.  Note  that  there  was  no 
overlap  between  training  and  testing  data.  For  each  subset  a  single 
identification  session  was  conducted,  during  which  the  average  ID 
time  per  query  vector  was  tracked.  The  timing  result  then  for 
speaker  sets  of  size  N  was  the  pooled  average  query  time  over  its 
30  subsets. 


To  ensure  CCE  was  a  viable  alternative  in  real-world  applications, 
VQ  speaker  identification  (SID)  was  used  as  a  testbed  application. 
In  SID,  the  database  is  the  enrolled  speaker  models  and  a  query  is  a 
feature  vector  from  an  unknown  speaker  we  wish  to  match  with  an 
enrolled  speaker. 

Designing  an  optimal  classifier  for  SID  is  a  challenging  task, 
since  one  would  like  to  compare  feature  vectors  originating  from 
unique  acoustic  classes,  such  as,  quasi-periodic,  noise-like,  and 
impulse-like  speech  sounds.  The  VQ  approach  obtains  distinct 
acoustic  classes  without  actually  associating  these  classes  to 
specific  phonemes  or  other  speech  categories.  This  reduces  the 
variation  in  the  features  due  to  phonetic  differences  while 
preserving  the  variation  due  to  speaker  differences  [5].  However, 
being  able  to  accurately  detect  these  distinct  acoustic  states  from 
the  speech  signal  remains  an  active  research  area. 

VQ  uses  the  /:-nearest  neighbor  clustering  algorithm  to  find  k 
unique  cluster  centroids,  known  as  codebooks,  which  are  derived 
from  each  speaker  in  the  training  data  set.  To  recognize  an 
unknown  speaker,  each  test  feature  vector  is  compared  to  each 
enrolled  speaker’s  codebook  and  the  minimum  distance  is 
determined.  The  enrolled  speaker  with  the  smallest  average  of  the 
above  minimum  distances  is  used  to  classify  the  test  speaker  as  that 
enrolled  member.  For  clustering  and  comparisons,  we  use  the 
familiar  Euclidean  distance,  also  known  as  the  L2-distance. 

5.  EXPERIMENT  SETUP 

In  this  section  we  describe  the  hardware  and  operating  system 
setup,  data  setup  and  the  parameters  for  both  the  VQ  and  CCE 
implementation  used  in  our  testbed  experiment. 

5.1.  Hardware/OS  setup 

Experiments  were  run  using  a  4  x  3.0  GHz  AMD  Opteron  8222 
processor  with  32  GB  of  RAM.  Operating  system  was  OpenSuSE 
10.2  (64-bit),  but  the  Speaker  ID  application  was  single-threaded 
and  compiled  in  32-bit  mode. 

5.2.  Data  description 

We  used  a  pool  of  2185  speakers  supplied  from  3  Corpora: 
TIMIT,  2000  NIST  SRE  and  an  in-house  corpus.  The  in-house 
corpus  was  conversational  with  the  conversations  held  face-to-face 
with  the  microphone  positioned  on  the  table  facing  one  speaker. 
The  data  collected  from  this  primary  speaker  was  the  only  data 
used  in  these  experiments  and  the  secondary  speaker's  audio  was 
removed  from  all  files.  Recordings  were  done  with  a  Samson 
COlU  Condenser  USB  microphone  in  a  low  noise  meeting  room 
environment.  A  total  of  632  speakers  were  collected  with  an 
average  of  about  100  seconds  of  combined  audio  per  speaker. 

Erom  our  speaker  pool,  we  investigated  subsets  of  10,  20,  ..., 
90,  300,  350,  ...,  600  and  1500  speakers.  Eor  each  speaker  set 


Table  1.  Training  and  Testing  Statistics 


Training  Duration  (secs) 

Testing  Duration  (secs) 

min 

avg 

std 

max 

min 

avg 

std 

max 

9 

69 

45 

165 

1 

6 

3 

10 

5.3.  VQ  implementation 

Each  speaker’s  model  was  composed  of  three  separate  64-word 
codebooks:  Linear  Predictive  Cepstra  Coefficients  (LPCC), 

LPCC-deltas,  and  Hamming  liftered  LPCC.  The  LPCCs  were  14* 
order  and  computed  over  pre-emphasized,  Hamming  windowed,  32 
ms  frames  at  a  16  ms  frame  rate.  An  enrolled  speaker’s  score 
against  a  test  speaker’s  collection  of  query  vectors  was  the  equi- 
weighted  sum  of  the  3  averaged  feature  scores  (after  each  had  been 
scaled  by  the  minimum  averaged  feature  score  across  speakers). 

5.4.  CCE  implementation 


Eor  each  14-dimensional  feature  space,  we  allocated  16  bits  for 
partitioning  such  that  the  0*  and  dimensions  were  allotted  2  bits 
each,  and  the  remaining  dimensions  each  were  allotted  1  bit.  This 
partitioning  was  speaker- specific — the  following  example  explains 
what  we  mean  by  this:  Eor  the  j-ih  dimension,  speaker  /’s  LPCC 
codebook  determined  the  partitioning  of  the  bounded  interval 
[a.  ^  such  that  A.  ^  was  equal  to  the  mean,  Jd^J,  of  the  y-th 


components  of  speaker  /’s  LPCC  codewords  minus  2-standard 
deviations,  2'a.^. .  Similarly,  v.  .  was  equal  to  So 


the  partitioning  points  for  [A.  J  with  respect  to  speaker  i  were 


P.v[0]  =  A,,+0-<5, 


(12) 


P,j[bj]  =  Ki+b^-6 

and  where 


The  delta  and  lifter  codebooks’  CCEs  were  similarly  derived. 

Although  we  originally  intended  CCE  to  be  a  full-search 
equivalent  NNS  approach,  our  implementation  was  not  since  the 
interval  ]  did  not  guarantee  to  encapsulate  all  possible 

values  that  a  query  vector  could  take.  However,  this  was  not 
detrimental,  as  our  results  will  demonstrate.  Eurthermore,  the 
partitioning  defined  by  (12)  and  (13)  required  we  modify  the 
region- assignment  criteria  for  query  vector  v^’s  components  that 

fell  outside  of  [2^,  In  particular,  if  v^j  <A. ^ ,  then  v^j  was 


assigned  to  region  Alternatively  if  i;.  ^.  <  ,  then  v^j  was 

assigned  to  region  r^.[2^^-lj.  This  assignment  modification 
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raises  two  issues:  1)  What  if  is  a  representative  vector  from 
enrolled  speaker  /?,  or  2)  What  if  it  is  not?  If  the  first  case  is  true 
and  if  .  ^  [a.  then  the j-th  component  is  an  outlier  of 

enrolled  speaker  i,  and  approximating  its  value  as  one  of  the 
endpoint  regions  serves  to  correct  its  outlier  status.  If  the  second 
case  is  true,  then  the  cellular  class  of  enrolled  speaker  i  to  which 

is  mapped  may  refer  to  some  candidate  NNs  from  speaker  i  that  are 
false  candidates.  Those  false  NNs  will  be  eliminated  during  the 
actual  L2-distance  calculations;  the  only  harm  done  is  the  extra 
distance  calculations.  The  timing  results  will  bear  out  if  these  extra 
steps  negate  CCE  as  a  viable  alternative  to  BF.  A  64-bit  integer 
was  used  to  encode  each  cell’s  class  of  candidate  NN  codewords. 

6.  EXPERIMENT  RESULTS 

6.1.  Timing  results 

Figures  2  and  3  give  the  pooled  average  timing  results  for  the 
examined  speaker  set  sizes.  In  all  cases,  CCE  outperformed  BF — 
ranging  from  1.6  to  1.7  times  faster  for  less  than  100  enrolled 
speakers  down  to  1.1  times  faster  for  1500  speakers  (see  Figure  4). 


Fig.  2.  CCE’s  vs.  BF’s  pooled  avg  SID  query  time  per  test 
vector  for  10  to  90  enrolled  speakers. 


Fig.  3.  CCE’s  vs.  BF’s  pooled  avg  SID  query  time  per  test 
vector  for  300  to  1500  enrolled  speakers. 


Fig.  4.  CCE’s  pooled  avg  speedup  factor  over  brute  force. 

6.2.  Accuracy  results 

Since  our  CCE  implementation  was  not  a  full-search  equivalent  to 
BF,  it  was  important  to  ensure  identification  accuracies  were 


comparable  to  BF.  CCE  does  indeed  maintain  comparable 
accuracy,  with  some  instances  performing  negligibly  better  or 
worse  than  BF  (fig.  5). 


Fig.  5.  CCE’s  vs.  BF’s  avg  pooled  SID  accuracy. 

6.3.  Memory  requirements 

Our  CCE  implementation  required  an  additional  1,573,416  bytes 
per  enrolled  speaker — broken  down  as:  (2^^  cells)  •  (8  bytes  /  class) 
•  (3  feature  codebooks)  to  encode  the  cellular  classes,  plus  (46 
partition  points  across  all  14  dimensions)  •  (4  bytes  /  float)  •  (3 
feature  codebooks)  for  the  partition  points. 

7.  DISCUSSION/CONCLUSIONS 

The  CCE  search  methodology  for  NNS  has  been  demonstrated  to 
be  significantly  more  efficient  than  brute-force  search  in  the 
context  of  VQ  SID,  and  in  preliminary  benchmarks  faster  than  both 
AES  A  and  Vector  Approximation-File.  This  increase  in  efficiency 
is  especially  robust  for  speaker  sets  of  less  than  100,  where 
processing  time  was  routinely  reduced  by  38  to  41%.  Since  larger 
speaker  sets  may  always  be  broken  down  into  smaller  sets  one  can 
easily  implement  the  CCE  approach  in  a  manner  that  reliably 
increases  performance  by  1 .6  to  1 .7  times  faster.  Furthermore,  the 
CCE  approach  was  able  to  maintain  virtually  identical  speaker 
recognition  accuracy  when  compared  to  BF. 

Lastly,  future  research  shall  apply  the  CCE  method  to  VQ 
optimization  in  the  speech  coding  domain  and  real-time  search 
applications.  Whether  non-uniform  feature  space  partitioning  can 
increase  the  CCE’s  filtering  capacity  will  also  be  investigated. 
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